SUPPLEMENT ARTICLE 



Evaluation of Tuberculosis Diagnostics: 
Establishing an Evidence Base Around the Public 
Health Impact 

Richard J. Lessells, 1 - 2 Graham S. Cooke, 2 - 3 Marie-Louise Newell, 3 - 4 and Peter Godfrey-Faussett 1 

'Department of Clinical Research, London School of Hygiene and Tropical Medicine, and 2 Africa Centre for Health and Population Studies, University of 
KwaZulu-Natal, Mtubatuba, KwaZulu-Natal, South Africa; 3 UCL Institute of Child Health, and 4 Department of Infectious Diseases, Imperial College, 
London, United Kingdom 



The limitations of existing tuberculosis diagnostic tools are significantly hampering tuberculosis control 
efforts, most noticeably in areas with high prevalence of human immunodeficiency virus (HIV) infection and 
antituberculosis drug resistance. However, renewed global interest in tuberculosis research has begun to bear 
fruit, with several new diagnostic technologies progressing through the development pipeline. There are 
significant challenges in building a sound evidence base to inform public health policies because most 
diagnostic research focuses on the accuracy of individual tests, with often significant limitations in the design, 
conduct, and reporting of diagnostic accuracy studies. Diagnostic accuracy studies may not be appropriate to 
guide public health policies, and clinical trials may increasingly be required to determine the incremental value 
and cost-effectiveness of new tools. The urgent need for new diagnostics should not distract from pursuing 
rigorous scientific evaluation focused on public health impact. 



Global control of the tuberculosis epidemic is a public 
health priority [1, 2]. The targets for reduction in tu- 
berculosis prevalence and mortality linked to the Mil- 
lennium Development Goals and enshrined in the STOP 
TB Global Plan 2006-2015 will not be achieved with 
current interventions [3, 4]. There is an acute need for 
improved tuberculosis diagnostics as one critical com- 
ponent of the public health response to the tuberculosis 
epidemic. 

The rapid growth of the human immunodeficiency 
virus (HIV) epidemic and the emergence of antitu- 
berculosis drug resistance have highlighted the major 
deficiencies in current diagnostic technologies both 
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for pathogen detection and for diagnosis of drug 
resistance [5]. In most high-burden countries, sputum 
smear microscopy remains the principal tool for di- 
agnosing active disease; however, operationally, its sensi- 
tivity for pulmonary tuberculosis can be as low as 20% 
[6, 7]. Sputum culture and drug susceptibility testing are 
available in certain settings, but their impact is limited by 
the long duration and complexity of the laboratory pro- 
cesses [8]. Additional challenges are faced in developing 
diagnostics for extrapulmonary tuberculosis, pediatric 
tuberculosis, and latent tuberculosis infection [9-11]. 

The STOP TB Global Plan 2006-2015 included the 
target that, "by 2010, simple, robust, affordable tech- 
nologies for use at peripheral levels of the health system 
will enable rapid, sensitive detection of active tubercu- 
losis at the first point of care" [4, p. 24]. Although this 
has not been achieved, there have been developments in 
the tuberculosis diagnostic field, and promising tech- 
nologies have entered the clinical sphere [6, 12-15]. 
Most promising has been the Xpert MTB/RIF system, 
an automated molecular test that simultaneously detects 
Mycobacterium tuberculosis and mutations associated 
with rifampicin resistance [16, 17]. It is hoped that the 
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renewed global focus on tuberculosis will in the next few years 
lead to the further proliferation of diagnostic technologies in 
parallel with advances in therapeutics and vaccines. 

It is the responsibility of the global scientific community to 
correctly evaluate these new technologies so that proven effective 
and cost-effective diagnostics can be adopted, thus generating 
the greatest public health impact. The importance of diagnostic 
research in the overall tuberculosis research agenda has been 
highlighted by many different groups [2, 15, 18-22]. However, 
huge gaps in funding for tuberculosis research and tuberculosis 
control remain [1, 2, 23]; this should force us to rethink how 
diagnostic research can be most effectively targeted and ratio- 
nalized to inform public health policies. 

This article focuses on the framework for evaluation of new 
diagnostics: at the outset, we look at the potential benefits of 
new diagnostics, and then we discuss different methodologies to 
evaluate diagnostic performance with a view to their ultimate 
implementation. Our focus throughout is on diagnostic tests for 
detection of active tuberculosis disease and/or drug resistance in 
high-burden countries. 

POTENTIAL IMPACT OF NEW TUBERCULOSIS 
DIAGNOSTICS 

It has been hypothesized that a test more sensitive than sputum 
microscopy for tuberculosis would be the diagnostic inter- 
vention that would alleviate the greatest burden of infectious 
disease in developing countries [24]. More specifically, one 
mathematical model of the global tuberculosis epidemic sug- 
gested that a new rapid diagnostic test with 100% sensitivity, 
100% specificity, and 100% access could prevent 625 000 deaths 
annually (equivalent to 36% of all tuberculosis-related deaths) 
[25]. Other models have derived fairly consistent estimates of 
mortality reductions of 17%-23% from a more sensitive rapid 
tuberculosis diagnostic, despite exploring different epidemics 
[26-28]. In one model, the estimated benefit in terms of mor- 
tality from a new diagnostic test was equivalent in magnitude to 
that expected from a novel vaccine or an optimized 2-month 
treatment regimen for active disease [26]. This highlights 2 im- 
portant points: (1) no single intervention will have the impact 
required to meet tuberculosis control targets; thus, scaled-up 
investment in research and implementation of diagnostics, 
drugs, and vaccines will be required; and (2) because new di- 
agnostics could have an equivalent impact to new drugs or 
vaccines, evaluation of diagnostics should be as rigorous as 
evaluation of drugs and vaccines. 

EXISTING FRAMEWORK FOR TUBERCULOSIS 
DIAGNOSTIC RESEARCH AND DEVELOPMENT 

The fact that sputum smear microscopy remains the cornerstone 
of tuberculosis diagnosis in most high-burden countries is 
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Figure 1. Stepwise approach to evaluation of diagnostic technologies. 

testament to the relative paucity of research and development in 
the diagnostic arena and the failure to translate research findings 
into policy. In medicine broadly, diagnostic research tends to be 
performed in stepwise fashion, with basic science leading to 
laboratory-based performance evaluation and then to clinical 
studies (Figure 1) [29]. This structure inherently tends to ex- 
clude the perspectives of end users in the conception and de- 
velopment of diagnostics, although more recently in the 
tuberculosis field, organizations have assisted this process by 
defining the ideal specifications for a point-of-care test [30]. 

In the tuberculosis field, the process of diagnostic de- 
velopment has rarely gone beyond diagnostic accuracy studies to 
assess the impact in clinical practice on clinical decision making, 
patient outcomes, and health system costs [13, 31, 32]. This is in 
part explained by the fact that the regulatory framework for in 
vitro diagnostic devices usually does not require evidence be- 
yond performance data. Diagnostic accuracy studies are an 
important part of the evaluation process. However, there is 
much potential for bias in such studies, and diagnostic accu- 
racy might vary widely between different clinical settings and 
populations [33-36]. 

In the field of diagnostic accuracy research, there have been 
certain key initiatives aimed at improving and standardizing 
research methodologies and reporting: the guidelines for di- 
agnostic evaluation produced by the TDR Diagnostics Evaluation 
Expert Panel (DEEP) [37], the Quality Assessment of Diagnostic 
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Item 

1 Was the spectrum of patients representative of the patients who will receive the test in practice? 

2 Were selection criteria clearly described? 

3 Is the reference standard likely to correctly classify the target condition? 

4 Is the time period between reference standard and index test short enough to be reasonably sure that the target condition did not change between the two tests? 

5 Did the whole sample, or a random selection of the sample, receive verification using a reference standard of diagnosis? 

6 Did patients receive the same reference standard regardless of the index test result? 

7 Was the reference standard independent of the index test (i.e. the index test did not form part of the reference standard)? 

8 Was the execution of the index test described in sufficient detail to permit replication of the test? 

9 Was the execution of the reference standard described in sufficient detail to permit its replication? 

10 Were the index test results interpreted without knowledge of the results of the reference standard? 

1 1 Were the reference standard results interpreted without knowledge of the results of the index test? 

12 Were the same clinical data available when test results were interpreted as would be available when the test is used in practice? 

13 Were uninterpretable/intermediate test results reported? 

1 4 Were withdrawals from the study explained? 

Figure 2. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool. 



Accuracy Studies (QUADAS) tool [38], and the Standards for 
the Reporting of Diagnostic Accuracy Studies (STARD) ini- 
tiative [39, 40]. The DEEP guidelines outline best practice in 
the design and conduct of diagnostic evaluations, with focus on 
performance characteristics and operational feasibility. QUA- 
DAS is a quality assessment tool to be used specifically for the 
assessment of diagnostic accuracy studies included in system- 
atic reviews. The tool consists of 14 items (Figure 2); the ma- 
jority involve sources of bias, with a few relating to variability 
and quality of reporting. The objective of the STARD initiative 
is to improve the quality of reporting of diagnostic accuracy 
studies. The 25-item checklist (Figure 3) allows the reader to 
judge the potential for bias (internal validity) and the gener- 
alizability and applicability (external validity) of the study. 

A systematic review that used both QUADAS and STARD 
criteria to assess tuberculosis diagnostic accuracy studies pub- 
lished during 2004-2006 showed significant deficiencies in 
methodology and reporting of studies [41 ] . Unfortunately, more 
widespread use of the STARD system has not been apparent in 
recent years. As a further example, of the 10 published studies 
evaluating the diagnostic accuracy of the Genotype MTBDRplus 
assay (published during 2007-2010) [42-51], only one manu- 
script explicitly mentions STARD [51]. Additional efforts are 
required by researchers, research funders, journal editors, and 
policy makers to encourage the use of these tools, with the aim 
of improving the quality and validity of this element of the 
evidence base. 

THE NEED FOR HIGH-QUALITY EVIDENCE TO 
INFORM PUBLIC HEALTH POLICIES 

Public health policies and guidelines are now usually informed 
by a systematic approach to judging the relevant evidence. In the 
tuberculosis field, the World Health Organization (WHO) con- 
venes expert groups to assess the available evidence for a specific 



intervention (eg, diagnostic test), and this group then presents 
their findings to the WHO Strategic and Technical Advisory 
Group for Tuberculosis (STAG-TB) for consideration and en- 
dorsement. The system to assess the evidence now adopted by 
many organizations, including WHO, is the Grading of Recom- 
mendations Assessment, Development, and Evaluation (GRADE) 
system, which incorporates judgments on the quality of evidence 
(high, moderate, low, or very low) and on the strength of any 
recommendation (initially categorized as strong or weak; now 
incorporates "conditional," whereby national programs should 
consider implementation based on their own situation) [52, 53]. 

The GRADE system is based around the concept of patient- 
important outcomes, and as such, evidence from diagnostic 
interventions creates additional challenges. Studies using in- 
direct outcomes (eg, diagnostic accuracy studies) will usually 
provide lower-quality evidence because of the uncertainty about 
outcomes important to patients and the potential for bias [54]. 
It is important to be clear that the rating of low quality in this 
context does not necessarily imply that studies were conducted 
poorly, but that data from the study are not optimal for deriving 
public health recommendations. 

GOING BEYOND DIAGNOSTIC ACCURACY 
STUDIES— THE NEED FOR IMPACT DATA 

In the STOP TB New Diagnostics Working Group blueprint 
for the evaluation of diagnostics, the next step after diagnostic 
accuracy studies are demonstration studies, which include 
patient outcomes (Figure 4) [55]. These demonstration studies 
are designed to assess the scaled-up test performance and to 
determine patient-level outcomes. This is the stage of the 
evaluation process that should start to inform policy. It is 
stated in this document that patient-important outcomes 
should be assessed (eg, time to initiation of treatment, time to 
smear and/or culture conversion, and treatment outcome) and 
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Figure 3. Standards for the Reporting of Diagnostic Accuracy Studies (STARD) checklist. 



that "these impact-related data should be compared to his- 
torical data recorded prior to implementation of the new test in 
routine clinical practice" [55, p. 62]. This use of historical data 



is problematic as a method of assessing any health care in- 
tervention and would not generally be accepted by regulatory 
bodies in the field of drugs or vaccines [56]. It is difficult to be 
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Figure 4. The pathway for evaluation of new diagnostics (from the STOP TB New Diagnostic Working Group). 



sure that any comparison is fair; there are potential sources of 
bias, and consequently, the risk is that the value of the in- 
tervention can be exaggerated. 

Two organizations that have been instrumental in driving 
forward development and evaluation of diagnostic technologies 
for tuberculosis are the Foundation for Innovative New Diag- 
nostics and the WHO TDR program (Special Programme for 
Research and Training in Tropical Diseases). Demonstration 
studies are key elements of their tuberculosis projects, which aim 
to determine the feasibility, impact, and cost-effectiveness of the 
diagnostic test under evaluation. The evidence from these studies 
is a key element assessed by the expert groups and reported to 
STAG-TB. If we take the example of the Genotype MTBDRpZws 
assay, preliminary data regarding patient-important outcomes 
from the South African demonstration projects seemed rela- 
tively disappointing because the median turnaround times did 
not meet their predefined objective of 7 days; of the patients 
with multidrug-resistant tuberculosis who were identified, only 
28% were started on appropriate therapy on the basis of the test 
result (42% had therapy delayed until results of conventional 
drug susceptibility testing were available) [57]. Although these 
results were based only on preliminary data analysis and are 
understandable during implementation of a new technology, 
there has, to our knowledge, been no further published evi- 
dence from high-burden settings on patient-important outcomes. 
However, the test has been introduced into routine practice in 
some countries, and its use is now being scaled up [58]. 

It is generally considered that the optimal methodology for 
assessing the clinical impact of any intervention, including di- 
agnostics, is the randomized controlled trial (RCT) [59-61]. 
This is the methodology least prone to bias in estimating the 
benefits and risks of any intervention. Data from RCTs can 
additionally be used to perform economic evaluation, a step of 
major importance for policy makers. The relative shortage of 
RCTs in diagnostic research, in contrast to therapeutic and 
vaccine research, is likely to be explained by a combination of 
factors: lack of emphasis on this level of evidence by manu- 
facturers and regulatory authorities, limited funding and poor 



coordination of diagnostic research, and logistical and ethical 
challenges. There are features specific to diagnostic trials that 
complicate trial design and implementation. In a tuberculosis 
diagnostic study, the population of interest might be persons 
with suspected pulmonary tuberculosis (eg, individuals with 
cough). Inevitably, the majority of participants will not have 
tuberculosis; thus, the potential effect size on the total cohort 
resulting from improved diagnosis is relatively small. However, 
we have to include the entire cohort in a trial if we want to 
capture comprehensive outcome data (to balance benefits and 
harms). 

To reveal the value of well-designed RCTs in diagnostic re- 
search, it is worthwhile to stop studying tuberculosis and con- 
sider malaria, another global health priority. Malaria rapid 
diagnostic tests (RDTs) have been shown to have good di- 
agnostic accuracy [62], and mathematical models have sug- 
gested that implementation of RDTs could lead to significant 
public health benefits in settings where malaria is endemic [63]. 
Trials were designed to assess the performance of the tests in 
a field setting and to measure the impact on health care pro- 
viders, therapeutic decisions, and patient outcomes [64-67]. 
Three of these trials showed that, despite good diagnostic ac- 
curacy, there was no reduction in incorrect antimalarial treat- 
ment with the use of RDTs [64-66]; of more concern, one trial 
even showed a significant reduction in correct antimalarial 
treatment [66]. These trials have provided vital information for 
the further development and implementation of RDTs. The 
results of these trials highlight the fact that a diagnostic test is 
only ever a vehicle to guide therapies; it is never of therapeutic 
benefit, and it is the treatment decision that will impact on 
patient outcomes. 

CONCEPTUALIZING CLINICAL TRIALS OF 
TUBERCULOSIS DIAGNOSTICS 

The first step in any trial is to determine the hypothesis that is to 
be tested because this will inform the trial design. It is important 
to consider the likely position of the new test in the diagnostic 
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process. In the case of a test for active pulmonary tuberculosis, 
we need to decide how the test will be introduced in the existing 
diagnostic structure, which includes sputum microscopy, spu- 
tum culture, drug- susceptibility testing, and chest radiography. 
It could be proposed as a replacement for > 1 of these tests, as an 
addition to these tests, or as a means of triage, for example, to 
target sputum culture and/or drug-susceptibility testing. This 
decision is in turn likely to depend on the proposed benefits of 
the new test (eg, whether it is more rapid, more sensitive, more 
specific, less technical, safer, or less expensive). Furthermore, we 
need to consider the outcomes of interest, whether related to 
benefit or harm; these may be appropriate or inappropriate 
commencement of tuberculosis treatment, outcomes during 
treatment (smear or culture conversion), final treatment out- 
comes (cure or completion), and mortality. 

One possible reason to explain the lack of RCTs in diagnostic 
research is the perception that diagnostic tests carry minimal or 
no risk. Although the test is unlikely to harm the patient, the 
consequences of the test (eg, the therapeutic decision) may 
confer harm, as shown in the example of RDTs of malaria. What 
risks might we expect in a trial of a tuberculosis diagnostic? 
Consider a hypothetical trial comparing clinical outcomes be- 
tween a rapid molecular tuberculosis test and the standard- 
of-care diagnostic pathway (Figure 5). At a basic level, this trial 
will tell us whether the benefits from earlier correct diagnosis 
or exclusion of tuberculosis outweigh the risks from incorrect 
classification of disease (false-negative or false-positive results). 
The benefits would seem to be self-evident but need to be 
quantified. The risks are more complicated and will be context 
specific. False-negative diagnoses will result in appropriate 
treatment being withheld, with potential for poorer outcomes. 
False-positive diagnoses also carry risk, however, because alter- 
native diagnoses may not be considered and, therefore, not 
treated, and patients may be exposed to potentially toxic ther- 
apy. For diagnosis of drug resistance, the risks from incorrect 
classification are even more complicated. False-negative results 
of genotypic testing may lead to inappropriate treatment with 
first-line regimens, with consequent adverse outcomes, in- 
cluding amplification of drug resistance. False-positive results 
may lead to inappropriate treatment with multidrug-resistant 
tuberculosis regimens, with lower efficacy against sensitive 
strains and with risks of severe toxicity. 

These examples highlight another challenge with tubercu- 
losis diagnostic research (and common to much diagnostic 
research), which is the lack of a perfect gold standard with 
which to compare new tests. If our new test is potentially more 
sensitive than the existing test (as might be the case with mo- 
lecular tests, compared with sputum culture), this will affect 
any analysis. The lack of a gold standard often requires a con- 
struct gold standard that comprises information from the ref- 
erence test with additional clinical information and follow-up 
information [68]. Of further concern, discrepancies between 
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Figure 5. Potential impact of false-positive and false-negative tubercu- 
losis diagnoses in a hypothetical trial comparing a rapid molecular test 
to tuberculosis culture. 



phenotypic and genotypic drug-susceptibility results can be 
extremely difficult to interpret, and it is not always clear which 
is the more reliable measure of drug resistance [69]. In many 
ways, these issues reinforce the need for well-designed clinical 
trials because thorough interpretation of the tests may only be 
possible with meticulously collected baseline and follow-up 
clinical data. 

PRACTICAL TRIAL DESIGNS 

If the outcomes of interest are individual-level outcomes (eg, 
treatment initiation and mortality), a clinical trial with individual 
randomization would be the logical and statistically most effi- 
cient design. However, because there will be information re- 
garding the diagnostic performance from the laboratory-based 
evaluation, the question arises, if the test is shown to have 
comparable accuracy to an existing test but has other advan- 
tages (ie, more rapid and/or less invasive), is it ethical to 
conduct an RCT with individual randomization? Critical to this 
decision is whether there is equipoise regarding the clinical 
outcome. Equipoise with regard to clinical outcomes of a di- 
agnostic strategy arises, for example, when the consequences of 
misdiagnosis are severe (eg, HIV-infected patients who receive 
a misdiagnosis of tuberculosis who are dying of another 
HIV-related illness) or when failure to diagnose does not lead 
to mistreatment or poorer outcomes (eg, patients prescribed 
tuberculosis treatment regardless of the test result). 

Individual randomization may, however, present consider- 
able logistical challenges in certain health care settings, and for 
this reason, cluster randomized designs may be considered with 
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health care units (eg, hospitals, clinics, and mobile teams) as 
clusters. Cluster randomized designs are increasingly used in 
public health research. The principal reasons for considering 
such a design are as follows: if the intervention is to be delivered 
to groups rather than individuals, if the outcome is to be mea- 
sured at a population level, or to avoid contamination by in- 
dividuals in the same community who are randomized to 
different trial arms [70]. However, there is also an acceptance 
that cluster randomization may also be appropriate in settings 
where it offers greater logistical convenience, compared with an 
individually randomized trial, although cluster RCTs generally 
require larger sample sizes and have added challenges in design, 
analysis, and ethics [70-72]. 

A further modification of the cluster randomized design is the 
phased implementation or stepped-wedge design [70, 73]. The 
key features of this design are that all clusters receive the in- 
tervention by the end of the trial, and the order in which the 
clusters receive the intervention is decided at random. This is 
particularly appropriate when there is preexisting evidence that 
the intervention may have a beneficial effect and when assigning 
clusters to the control arm for the duration of the trial might be 
ethically unacceptable. This might be particularly suited to 
evaluation of certain diagnostic technologies, for which there is 
evidence from initial diagnostic accuracy studies that suggests 
beneficial effect. 

If randomization is not deemed to be appropriate or feas- 
ible, alternative prospective trial designs, often termed quasi- 
experimental designs, may still be able to generate evidence on 
the effectiveness of diagnostics [74]. An example would be the 
pre- and postimplementation study in which outcomes are 
measured during a pre-intervention phase and subsequently 
during a postintervention phase. Although the lack of ran- 
domization threatens the internal validity (no firm conclusion 
can be made with regard to the effect of the intervention unless 
the effect size is large), there may conversely be a gain in external 
validity (improved generalizability of findings if fewer patients 
are excluded than in conventional RCTs). 

Retrospective studies may be the only methodology to obtain 
outcome data in circumstances in which a diagnostic is widely 
implemented on the basis of performance characteristics. Such 
pre- and postimplementation analyses have been used in high- 
resource settings to estimate the impact of molecular resistance 
testing on detection and treatment of multidrug-resistant tu- 
berculosis [75, 76]. 

Whether a clinical trial is justified in the evaluation of diag- 
nostics will ultimately depend on the balance between the 
benefit to be gained by accurately establishing the impact of 
a new tool and the costs of running a large clinical trial and 
potentially delaying full-scale implementation of an effective 
intervention. These decisions are not straightforward, and col- 
laboration between scientists and policy makers is vital to de- 
termine when diagnostic trials are necessary. 



CONCLUSIONS 

Recent developments in tuberculosis diagnostics have led to 
much optimism, but we still lack the tools that meet the needs of 
patients in high-burden countries. The next 10-20 years will 
hopefully see further developments in diagnostic technology. 
We need to ensure that the framework for evaluating diagnostic 
tools is best suited to ensuring that the tools with the greatest 
public health impact and cost-effectiveness are implemented 
and that those with minimal impact are developed further or are 
discarded. Diagnostic accuracy studies are an important early 
step in the evaluation process but do not produce sufficient 
evidence to inform public health policies. Well-designed pro- 
spective studies (including RCTs) should be integrated in the 
research pathway to provide reliable information on therapeutic 
impact, patient outcomes, and cost-effectiveness. This new era 
of tuberculosis diagnostics should be accompanied by a new era 
for diagnostic research focused clearly on the evaluation of 
public health impact. 
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